{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "第05课：面向非结构化数据转换的词袋和词向量模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "词袋模型（Bag of Words Model）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import jieba\n",
    "#定义停用词、标点符号\n",
    "punctuation = [\"，\",\"。\", \"：\", \"；\", \"？\"]\n",
    "#定义语料\n",
    "content = [\"机器学习带动人工智能飞速的发展。\",\n",
    "           \"深度学习带动人工智能飞速的发展。\",\n",
    "           \"机器学习和深度学习带动人工智能飞速的发展。\"\n",
    "          ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Building prefix dict from the default dictionary ...\n",
      "Loading model from cache /var/folders/yw/k7z_d_3567g16ss9plk47x9w0000gn/T/jieba.cache\n",
      "Loading model cost 0.640 seconds.\n",
      "Prefix dict has been built succesfully.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[['机器', '学习', '带动', '人工智能', '飞速', '的', '发展', '。'], ['深度', '学习', '带动', '人工智能', '飞速', '的', '发展', '。'], ['机器', '学习', '和', '深度', '学习', '带动', '人工智能', '飞速', '的', '发展', '。']]\n"
     ]
    }
   ],
   "source": [
    "#分词\n",
    "segs_1 = [jieba.lcut(con) for con in content]\n",
    "print(segs_1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[['机器', '学习', '带动', '人工智能', '飞速', '的', '发展'], ['深度', '学习', '带动', '人工智能', '飞速', '的', '发展'], ['机器', '学习', '和', '深度', '学习', '带动', '人工智能', '飞速', '的', '发展']]\n"
     ]
    }
   ],
   "source": [
    "tokenized = []\n",
    "for sentence in segs_1:\n",
    "    words = []\n",
    "    for word in sentence:\n",
    "        if word not in punctuation:          \n",
    "            words.append(word)\n",
    "    tokenized.append(words)\n",
    "print(tokenized)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['，', '。', '：', '；', '？']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "punctuation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['带动', '机器', '发展', '的', '学习', '深度', '和', '人工智能', '飞速']\n"
     ]
    }
   ],
   "source": [
    "#求并集\n",
    "bag_of_words = [ x for item in segs_1 for x in item if x not in punctuation]\n",
    "#去重\n",
    "bag_of_words = list(set(bag_of_words))\n",
    "print(bag_of_words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "bag_of_word2vec = []\n",
    "for sentence in tokenized:\n",
    "    tokens = [1 if token in sentence else 0 for token in bag_of_words ]\n",
    "    bag_of_word2vec.append(tokens)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[1, 1, 1, 1, 1, 0, 0, 1, 1],\n",
       " [1, 0, 1, 1, 1, 1, 0, 1, 1],\n",
       " [1, 1, 1, 1, 1, 1, 1, 1, 1]]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bag_of_word2vec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dictionary(9 unique tokens: ['人工智能', '发展', '学习', '带动', '机器']...)\n"
     ]
    }
   ],
   "source": [
    "from gensim import corpora\n",
    "import gensim\n",
    "#tokenized是去标点之后的\n",
    "dictionary = corpora.Dictionary(tokenized)\n",
    "#保存词典\n",
    "dictionary.save('deerwester.dict') \n",
    "print(dictionary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'人工智能': 0, '发展': 1, '学习': 2, '带动': 3, '机器': 4, '的': 5, '飞速': 6, '深度': 7, '和': 8}\n"
     ]
    }
   ],
   "source": [
    "#查看词典和下标 id 的映射\n",
    "print(dictionary.token2id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 词向量 （Word Embedding）"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用 One-Hot Encoder有以下问题：\n",
    "\n",
    "第一，词语编码是随机的，向量之间相互独立，看不出词语之间可能存在的关联关系。\n",
    "第二，向量维度的大小取决于语料库中词语的多少，如果语料包含的所有词语对应的向量合为一个矩阵的话，那这个矩阵过于稀疏，并且会造成维度灾难。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "而解决这个问题的手段，就是使用向量表示（Vector Representations）。比如 Word2Vec 可以将 One-Hot Encoder 转化为低维度的连续值，也就是稠密向量，并且其中意思相近的词也将被映射到向量空间中相近的位置。经过降维，在二维空间中，相似的单词在空间中的距离也很接近。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Word2Vec 主要包含两种模型：Skip-Gram 和 CBOW，值得一提的是，Word2Vec 词向量可以较好地表达不同词之间的相似和类比关系。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from gensim.models import Word2Vec  \n",
    "import jieba\n",
    "#定义停用词、标点符号\n",
    "punctuation = [\",\",\"。\", \":\", \";\", \".\", \"'\", '\"', \"’\", \"?\", \"/\", \"-\", \"+\", \"&\", \"(\", \")\"]\n",
    "sentences = [\n",
    "\"长江是中国第一大河，干流全长6397公里（以沱沱河为源），一般称6300公里。流域总面积一百八十余万平方公里，年平均入海水量约九千六百余亿立方米。以干流长度和入海水量论，长江均居世界第三位。\",\n",
    "\"黄河，中国古代也称河，发源于中华人民共和国青海省巴颜喀拉山脉，流经青海、四川、甘肃、宁夏、内蒙古、陕西、山西、河南、山东9个省区，最后于山东省东营垦利县注入渤海。干流河道全长5464千米，仅次于长江，为中国第二长河。黄河还是世界第五长河。\",\n",
    "\"黄河,是中华民族的母亲河。作为中华文明的发祥地,维系炎黄子孙的血脉.是中华民族民族精神与民族情感的象征。\",\n",
    "\"黄河被称为中华文明的母亲河。公元前2000多年华夏族在黄河领域的中原地区形成、繁衍。\",\n",
    "\"在兰州的“黄河第一桥”内蒙古托克托县河口镇以上的黄河河段为黄河上游。\",\n",
    "\"黄河上游根据河道特性的不同，又可分为河源段、峡谷段和冲积平原三部分。 \",\n",
    "\"黄河,是中华民族的母亲河。\"\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sentences = [jieba.lcut(sen) for sen in sentences]\n",
    "tokenized = []\n",
    "for sentence in sentences:\n",
    "    words = []\n",
    "    for word in sentence:\n",
    "        if word not in punctuation:          \n",
    "            words.append(word)\n",
    "    tokenized.append(words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    " model = Word2Vec(tokenized, sg=1, size=100,  window=5,  min_count=2,  negative=1, sample=0.001, hs=1, workers=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "sg=1 是 skip-gram 算法，对低频词敏感；默认 sg=0 为 CBOW 算法。\n",
    "\n",
    "size 是输出词向量的维数，值太小会导致词映射因为冲突而影响结果，值太大则会耗内存并使算法计算变慢，一般值取为100到200之间。\n",
    "\n",
    "window 是句子中当前词与目标词之间的最大距离，3表示在目标词前看3-b 个词，后面看 b 个词（b 在0-3之间随机）。\n",
    "\n",
    "min_count 是对词进行过滤，频率小于 min-count 的单词则会被忽视，默认值为5。\n",
    "\n",
    "negative 和 sample 可根据训练结果进行微调，sample 表示更高频率的词被随机下采样到所设置的阈值，默认值为 1e-3。\n",
    "\n",
    "hs=1 表示层级 softmax 将会被使用，默认 hs=0 且 negative 不为0，则负采样将会被选择使用。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "model.save('model')  #保存模型\n",
    "model = Word2Vec.load('model')   #加载模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.99999994\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/zhangjianfeng/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `similarity` (Method will be removed in 4.0.0, use self.wv.similarity() instead).\n",
      "  \"\"\"Entry point for launching an IPython kernel.\n"
     ]
    }
   ],
   "source": [
    "print(model.similarity('黄河', '黄河'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.031182129\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/zhangjianfeng/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `similarity` (Method will be removed in 4.0.0, use self.wv.similarity() instead).\n",
      "  \"\"\"Entry point for launching an IPython kernel.\n"
     ]
    }
   ],
   "source": [
    "print(model.similarity('黄河', '长江'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('中华民族', 0.16356922686100006), ('、', 0.15501049160957336), ('公里', 0.14551666378974915), ('是', 0.12099534273147583), ('于', 0.11948922276496887), ('河道', 0.11142636835575104), ('，', 0.061772894114255905), ('水量', 0.05794735625386238), ('长河', 0.05365785211324692), ('第一', 0.047971852123737335)]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/zhangjianfeng/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).\n",
      "  \n"
     ]
    }
   ],
   "source": [
    "# 预测最接近的词，预测与黄河和母亲河最接近，而与长江不接近的词\n",
    "print(model.most_similar(positive=['黄河', '母亲河'], negative=['长江']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Word2Vec 是一种将词变成词向量的工具。通俗点说，只有这样文本预料才转化为计算机能够计算的矩阵向量。\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Doc2Vec 是 Mikolov 在 Word2Vec 基础上提出的另一个用于计算长文本向量的工具。在 Gensim 库中，Doc2Vec 与 Word2Vec 都极为相似。但两者在对输入数据的预处理上稍有不同，Doc2vec 接收一个由 LabeledSentence 对象组成的迭代器作为其构造函数的输入参数。其中，LabeledSentence 是 Gensim 内建的一个类，它接收两个 List 作为其初始化的参数：word list 和 label list。\n",
    "\n",
    "Doc2Vec 也包括两种实现方式：DBOW（Distributed Bag of Words）和 DM （Distributed Memory）。DBOW 和 DM 的实现，二者在 gensim 库中的实现用的是同一个方法，该方法中参数 dm = 0 或者 dm=1 决定调用 DBOW 还是 DM。Doc2Vec 将文档语料通过一个固定长度的向量表达。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "#定义数据预处理类，作用是给每个文章添加对应的标签\n",
    "from gensim.models.doc2vec import Doc2Vec,LabeledSentence\n",
    "doc_labels = [\"长江\",\"黄河\",\"黄河\",\"黄河\",\"黄河\",\"黄河\",\"黄河\"]\n",
    "class LabeledLineSentence(object):\n",
    "    def __init__(self, doc_list, labels_list):\n",
    "       self.labels_list = labels_list\n",
    "       self.doc_list = doc_list\n",
    "    def __iter__(self):\n",
    "        for idx, doc in enumerate(self.doc_list):\n",
    "            yield LabeledSentence(words=doc,tags=[self.labels_list[idx]])\n",
    "\n",
    "   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "iter_data = LabeledLineSentence(tokenized, doc_labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/zhangjianfeng/anaconda3/lib/python3.6/site-packages/gensim/models/doc2vec.py:580: UserWarning: The parameter `size` is deprecated, will be removed in 4.0.0, use `vector_size` instead.\n",
      "  warnings.warn(\"The parameter `size` is deprecated, will be removed in 4.0.0, use `vector_size` instead.\")\n",
      "/Users/zhangjianfeng/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:10: DeprecationWarning: Call to deprecated `LabeledSentence` (Class will be removed in 4.0.0, use TaggedDocument instead).\n",
      "  # Remove the CWD from sys.path while we load stuff.\n"
     ]
    }
   ],
   "source": [
    "model = Doc2Vec(dm=1, size=100, window=8, min_count=5, workers=4)\n",
    "model.build_vocab(iter_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['长江',\n",
       "  '是',\n",
       "  '中国',\n",
       "  '第一',\n",
       "  '大河',\n",
       "  '，',\n",
       "  '干流',\n",
       "  '全长',\n",
       "  '6397',\n",
       "  '公里',\n",
       "  '（',\n",
       "  '以',\n",
       "  '沱沱河',\n",
       "  '为源',\n",
       "  '）',\n",
       "  '，',\n",
       "  '一般',\n",
       "  '称',\n",
       "  '6300',\n",
       "  '公里',\n",
       "  '流域',\n",
       "  '总面积',\n",
       "  '一百八十',\n",
       "  '余万平方公里',\n",
       "  '，',\n",
       "  '年',\n",
       "  '平均',\n",
       "  '入海',\n",
       "  '水量',\n",
       "  '约',\n",
       "  '九千',\n",
       "  '六百余',\n",
       "  '亿立方米',\n",
       "  '以',\n",
       "  '干流',\n",
       "  '长度',\n",
       "  '和',\n",
       "  '入海',\n",
       "  '水量',\n",
       "  '论',\n",
       "  '，',\n",
       "  '长江',\n",
       "  '均',\n",
       "  '居',\n",
       "  '世界',\n",
       "  '第三位'],\n",
       " ['黄河',\n",
       "  '，',\n",
       "  '中国',\n",
       "  '古代',\n",
       "  '也',\n",
       "  '称河',\n",
       "  '，',\n",
       "  '发源',\n",
       "  '于',\n",
       "  '中华人民共和国',\n",
       "  '青海省',\n",
       "  '巴颜喀拉山',\n",
       "  '脉',\n",
       "  '，',\n",
       "  '流经',\n",
       "  '青海',\n",
       "  '、',\n",
       "  '四川',\n",
       "  '、',\n",
       "  '甘肃',\n",
       "  '、',\n",
       "  '宁夏',\n",
       "  '、',\n",
       "  '内蒙古',\n",
       "  '、',\n",
       "  '陕西',\n",
       "  '、',\n",
       "  '山西',\n",
       "  '、',\n",
       "  '河南',\n",
       "  '、',\n",
       "  '山东',\n",
       "  '9',\n",
       "  '个',\n",
       "  '省区',\n",
       "  '，',\n",
       "  '最后',\n",
       "  '于',\n",
       "  '山东省',\n",
       "  '东营',\n",
       "  '垦利县',\n",
       "  '注入',\n",
       "  '渤海',\n",
       "  '干流',\n",
       "  '河道',\n",
       "  '全长',\n",
       "  '5464',\n",
       "  '千米',\n",
       "  '，',\n",
       "  '仅次于',\n",
       "  '长江',\n",
       "  '，',\n",
       "  '为',\n",
       "  '中国',\n",
       "  '第二',\n",
       "  '长河',\n",
       "  '黄河',\n",
       "  '还是',\n",
       "  '世界',\n",
       "  '第五',\n",
       "  '长河'],\n",
       " ['黄河',\n",
       "  '是',\n",
       "  '中华民族',\n",
       "  '的',\n",
       "  '母亲河',\n",
       "  '作为',\n",
       "  '中华文明',\n",
       "  '的',\n",
       "  '发祥地',\n",
       "  '维系',\n",
       "  '炎黄子孙',\n",
       "  '的',\n",
       "  '血脉',\n",
       "  '是',\n",
       "  '中华民族',\n",
       "  '民族',\n",
       "  '精神',\n",
       "  '与',\n",
       "  '民族',\n",
       "  '情感',\n",
       "  '的',\n",
       "  '象征'],\n",
       " ['黄河',\n",
       "  '被',\n",
       "  '称为',\n",
       "  '中华文明',\n",
       "  '的',\n",
       "  '母亲河',\n",
       "  '公元前',\n",
       "  '2000',\n",
       "  '多年',\n",
       "  '华夏',\n",
       "  '族',\n",
       "  '在',\n",
       "  '黄河',\n",
       "  '领域',\n",
       "  '的',\n",
       "  '中原地区',\n",
       "  '形成',\n",
       "  '、',\n",
       "  '繁衍'],\n",
       " ['在',\n",
       "  '兰州',\n",
       "  '的',\n",
       "  '“',\n",
       "  '黄河',\n",
       "  '第一',\n",
       "  '桥',\n",
       "  '”',\n",
       "  '内蒙古',\n",
       "  '托克托县',\n",
       "  '河口镇',\n",
       "  '以上',\n",
       "  '的',\n",
       "  '黄河',\n",
       "  '河段',\n",
       "  '为',\n",
       "  '黄河',\n",
       "  '上游'],\n",
       " ['黄河',\n",
       "  '上游',\n",
       "  '根据',\n",
       "  '河道',\n",
       "  '特性',\n",
       "  '的',\n",
       "  '不同',\n",
       "  '，',\n",
       "  '又',\n",
       "  '可',\n",
       "  '分为',\n",
       "  '河源',\n",
       "  '段',\n",
       "  '、',\n",
       "  '峡谷',\n",
       "  '段',\n",
       "  '和',\n",
       "  '冲积平原',\n",
       "  '三',\n",
       "  '部分',\n",
       "  ' '],\n",
       " ['黄河', '是', '中华民族', '的', '母亲河']]"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenized"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/zhangjianfeng/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:10: DeprecationWarning: Call to deprecated `LabeledSentence` (Class will be removed in 4.0.0, use TaggedDocument instead).\n",
      "  # Remove the CWD from sys.path while we load stuff.\n"
     ]
    }
   ],
   "source": [
    "model.train(iter_data,total_examples=model.corpus_count,epochs=1000,start_alpha=0.01,end_alpha =0.001)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('长江', 0.9885953664779663)]\n"
     ]
    }
   ],
   "source": [
    "#根据标签找最相似的，这里只有黄河和长江，所以结果为长江，并计算出了相似度\n",
    "print(model.docvecs.most_similar('黄河'))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.9885955\n"
     ]
    }
   ],
   "source": [
    "print(model.docvecs.similarity('黄河','长江'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'vec' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-34-ff5fd59699e4>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msklearn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msvm\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mSVC\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[0msvm\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mSVC\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkernel\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'linear'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0msvm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvec\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtransform\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx_train\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_train\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      4\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msvm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mscore\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvec\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtransform\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx_test\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_test\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mNameError\u001b[0m: name 'vec' is not defined"
     ]
    }
   ],
   "source": [
    "from sklearn.svm import SVC\n",
    "svm = SVC(kernel='linear')\n",
    "svm.fit(vec.transform(x_train), y_train)\n",
    "print(svm.score(vec.transform(x_test), y_test))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
